With our data, we wanted to examine the relationship between the price of Airbnb properties, and the neighborhood it is listed in. More specifically, we wanted to see if our model could predict a neighborhood based off of price, and a few other variables we chose later. We thought this would be an interesting dataset to explore with the growth of the tourism industry as vaccination rates rise and cities start to open up. Through our exploration and modeling,
We also wanted a way to predict the rating of the Airbnb properties. We thought this would be a good addition to our project, since the deal one would get booking an Airbnb does not matter a lot if the experience is bad. We want to find a way to find a good deal with a high rating.
Our dataset was obtained from Kaggle but the data itself was sourced from Inside Airbnb.
price was used 100% of the time, which is good since our main goal was to examine the relationship between price and neighborhood.
The first thing we did was read in our data.
boston <- read.csv("Boston_Airbnb_copy.csv")
Next we sub-setted columns we thought would make for interesting predictions, renamed columns so that they would be easier to call, and made ‘price’ and ‘cleaning_fee’ numeric variables.
boston <- boston%>%
select(c("name","neighbourhood_cleansed","latitude", "longitude", "room_type",
"accommodates", "price", "review_scores_rating",
"host_is_superhost", "property_type"))
boston <- boston%>%
rename(superhost = host_is_superhost, neighborhood = neighbourhood_cleansed)
boston$price = as.numeric(gsub("\\$", "", boston$price))
Here we looked at neighborhood counts, and saw that the most popular neighborhoods are Jamaica Plain, South End, Back Bay, Fenway, Dorchester, and Allston.
We thought grouping by neighborhood would be an easy way to do some basic exploration
table(boston$neighborhood)
##
## Allston Back Bay Bay Village
## 260 302 24
## Beacon Hill Brighton Charlestown
## 194 185 111
## Chinatown Dorchester Downtown
## 71 269 172
## East Boston Fenway Hyde Park
## 150 290 31
## Jamaica Plain Leather District Longwood Medical Area
## 343 5 9
## Mattapan Mission Hill North End
## 24 124 143
## Roslindale Roxbury South Boston
## 56 144 174
## South Boston Waterfront South End West End
## 83 326 49
## West Roxbury
## 46
## could be interesting to compare to means
bost_med_table <- boston%>%
group_by(neighborhood)%>%
summarise(medianReview = median(review_scores_rating, na.rm=T),
medianPrice = median(price, na.rm=T))
bost_med_table
## # A tibble: 25 × 3
## neighborhood medianReview medianPrice
## <chr> <dbl> <dbl>
## 1 Allston 94 85
## 2 Back Bay 93 209
## 3 Bay Village 93.5 206.
## 4 Beacon Hill 95 195
## 5 Brighton 94 90
## 6 Charlestown 96 178.
## 7 Chinatown 95 219
## 8 Dorchester 93 72
## 9 Downtown 94 225
## 10 East Boston 92 99
## # … with 15 more rows
plot <- ggplot(data=bost_med_table, aes(x=neighborhood, y=medianPrice, fill=medianPrice))+
scale_fill_gradient(low = "dark red", high = "cornflowerblue")+
geom_bar(stat='identity')+
theme(axis.text.x = element_text(angle=90))+
labs(x="Neighborhood", y="Median Price of Airbnb per Night", title="Distribution of Airbnb prices per night over Neighborhoods in Boston")
plot
boston%>%
group_by(neighborhood)%>%
select(room_type)%>%
table()
## Adding missing grouping variables: `neighborhood`
## room_type
## neighborhood Entire home/apt Private room Shared room
## Allston 98 156 6
## Back Bay 263 36 3
## Bay Village 20 4 0
## Beacon Hill 155 36 3
## Brighton 75 103 7
## Charlestown 68 42 1
## Chinatown 62 8 1
## Dorchester 66 195 8
## Downtown 144 24 4
## East Boston 70 77 3
## Fenway 208 73 9
## Hyde Park 6 24 1
## Jamaica Plain 157 181 5
## Leather District 3 2 0
## Longwood Medical Area 4 4 1
## Mattapan 3 21 0
## Mission Hill 48 68 8
## North End 119 21 3
## Roslindale 19 37 0
## Roxbury 58 81 5
## South Boston 102 69 3
## South Boston Waterfront 71 12 0
## South End 250 69 7
## West End 43 6 0
## West Roxbury 15 29 2
We see here that the top 100 most expensive Airbnbs are pretty evenly distributed over the neighborhoods. Back Bay has the highest number with 17 but doesn’t seem so high that it is an outlier. This also makes sense because Back Bay was one of the most popular neighborhoods in the first place.
boston_top100 <- boston%>%
arrange(desc(price))
head(boston_top100, 100)%>%
select(neighborhood)%>%
table()
## .
## Allston Back Bay Bay Village
## 3 17 5
## Beacon Hill Brighton Charlestown
## 9 3 4
## Downtown Fenway Jamaica Plain
## 3 6 11
## Mission Hill North End Roxbury
## 1 2 7
## South Boston South Boston Waterfront South End
## 8 7 13
## West End
## 1
First we picked which variables we thought would do the best at predicting neighborhoods. We chose to use price and rating, and room type since that is where we saw the greatest variation by neighborhood in our exploration.
clust_boston <- boston[, c("price", "review_scores_rating", "room_type")]
Formatting room_type to be usable in clustering
table(clust_boston$room_type)
##
## Entire home/apt Private room Shared room
## 2127 1378 80
clust_boston$room_type <- fct_collapse(clust_boston$room_type,
v1 = "Entire home/apt",
v2 = "Private room",
v3 = "Shared room")
clust_boston$room_type = as.numeric(gsub("v", "", clust_boston$room_type))
There are a few NAs in the price and rating columns, so we replaced them with the median
clust_boston$review_scores_rating[is.na(clust_boston$review_scores_rating)] <- median(clust_boston$review_scores_rating, na.rm=T)
clust_boston$price[is.na(clust_boston$price)] <- median(clust_boston$price, na.rm=T)
sum(is.na(clust_boston$review_scores_rating))
## [1] 0
sum(is.na(clust_boston$price))
## [1] 0
And then scaled our numeric variables
normalize <- function(x){
(x - min(x)) / (max(x) - min(x))
}
clust_boston[1:2] <- lapply(clust_boston[1:2], normalize)
Then we sorted the neighborhoods into 2 groups, central Boston and Boston suburbs.
boston$neighborhood_groups <- fct_collapse(boston$neighborhood,
Suburbs = c("Jamaica Plain", "Roslindale",
"Dorchester","Roxbury",
"West Roxbury", "Hyde Park",
"Mattapan", "Brighton", "Allston"),
Central_Boston = c("Bay Village", "Back Bay",
"Beacon Hill", "West End",
"North End", "Downtown",
"South End", "Chinatown",
"Leather District", "Fenway",
"Mission Hill", "Longwood Medical Area",
"South Boston", "South Boston Waterfront",
"Charlestown", "East Boston"))
And we used the elbow method to figure out how many clusters to choose.
Based on this graph, it looks like 2 clusters will give us the best model without over-fitting. This also makes sense because we collapsed the neighborhoods into 2 groups.
explained_variance = function(data_in, k){
set.seed(1)
kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
var_exp = kmeans_obj$betweenss / kmeans_obj$totss
var_exp
}
explained_var_boston = sapply(1:10, explained_variance, data_in = clust_boston)
explained_var_boston
## [1] -7.690460e-15 8.696275e-01 8.696275e-01 9.369267e-01 8.906499e-01
## [6] 8.988897e-01 8.991225e-01 9.664217e-01 9.768941e-01 9.801408e-01
elbow_boston = data.frame(k = 1:10, explained_var_boston)
ggplot(elbow_boston,
aes(x = k,
y = explained_var_boston)) +
geom_point(size = 4) +
geom_line(size = 1) +
xlab('k') +
ylab('Inter-cluster Variance / Total Variance') +
theme_light()
So we ran the clustering algorithm with 2 clusters
set.seed(123)
kmeans_obj_boston = kmeans(clust_boston, centers = 2,
algorithm = "Lloyd")
kmeans_obj_boston
## K-means clustering with 2 clusters of sizes 2127, 1458
##
## Cluster means:
## price review_scores_rating room_type
## 1 0.21318566 0.9089093 1.00000
## 2 0.08432192 0.8989626 2.05487
##
## Clustering vector:
## [1] 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2
## [38] 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 2 1 1 1 2 2 1
## [112] 1 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 2 2 2 2 1 1 1 2 2
## [149] 2 1 2 1 1 2 1 2 2 2 1 1 2 2 1 1 1 2 1 2 1 2 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2
## [186] 2 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1
## [223] 2 2 1 2 2 2 1 2 2 1 2 2 1 2 1 1 2 2 1 2 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1
## [260] 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 1
## [297] 1 1 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1
## [334] 1 1 2 1 2 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 1 1 2 2
## [371] 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [408] 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1
## [445] 2 2 1 1 1 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1
## [482] 1 2 2 2 2 1 2 1 2 1 1 1 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1
## [519] 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 2 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1
## [593] 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [630] 1 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1
## [667] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
## [704] 1 1 1 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
## [741] 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 2
## [778] 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 1 1 2 1
## [815] 2 2 2 2 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 1 1 2 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2
## [852] 2 2 2 2 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 2 2 1 2 2 2
## [889] 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
## [926] 1 1 1 2 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2
## [963] 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2
## [1000] 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1
## [1037] 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2
## [1074] 2 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1
## [1111] 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1
## [1148] 2 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1
## [1185] 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1
## [1222] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [1259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## [1296] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1
## [1333] 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2
## [1370] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [1407] 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [1444] 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1481] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1
## [1518] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 1
## [1555] 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 1 1 2
## [1592] 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1
## [1629] 1 2 2 1 1 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 2 2 1 2 2 1 2 1 2 2
## [1666] 1 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 1 1 2 1 1
## [1703] 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1
## [1740] 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 2
## [1777] 2 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [1814] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1
## [1851] 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1
## [1888] 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1
## [1925] 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1962] 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [1999] 2 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
## [2036] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 2 1 1 2 1 1 1
## [2073] 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [2110] 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1
## [2147] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1
## [2184] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 2 2 1 1 1 1
## [2221] 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1
## [2258] 1 2 2 1 2 2 1 2 1 2 1 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1
## [2295] 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 2 1
## [2332] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [2369] 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2
## [2406] 1 2 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1
## [2443] 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 1 1 2
## [2480] 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2 1 2 2 2 1
## [2517] 2 1 1 2 2 1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 2 1 2 1 2 2 1 1 2 2
## [2554] 2 1 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 1 2
## [2591] 2 1 2 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 2 1 1 1 2 2 1
## [2628] 1 2 2 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 2 2 2
## [2665] 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2
## [2702] 2 2 2 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2
## [2739] 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1
## [2776] 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [2813] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2
## [2850] 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [2887] 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2
## [2924] 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2
## [2961] 1 2 2 2 2 1 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2
## [2998] 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 2 1 2 2
## [3035] 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [3072] 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [3109] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 1
## [3146] 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 2 1 2 1
## [3183] 1 2 1 1 1 1 2 1 2 1 2 1 2 2 1 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1 1 1 2 1 1
## [3220] 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 2 2 2 1 1 2 1
## [3257] 2 2 1 1 1 1 2 2 2 1 2 1 2 2 2 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1
## [3294] 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1
## [3331] 1 2 1 1 1 2 1 1 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1
## [3368] 1 2 2 2 2 1 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 2 1 1 1 1 2 1 2 2 2 1
## [3405] 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2 2
## [3442] 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 1 1 1 1 2 1 1 2 2 1
## [3479] 1 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 2 1 1 1 2 2 1 2
## [3516] 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 2 1
## [3553] 2 2 2 2 2 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 43.9276 102.5454
## (between_SS / total_SS = 87.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
clusters_boston = as.factor(kmeans_obj_boston$cluster)
neighborhood_clusters = as.factor(kmeans_obj_boston$cluster)
ggplot(boston, aes(x = price,
y = review_scores_rating,
color = neighborhood_groups,
shape = neighborhood_clusters)) +
geom_point(size = 2) +
ggtitle("Price vs Rating of Boston Airbnbs") +
xlab("Price per Night") +
ylab("Review Score (out of 100)") +
scale_shape_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2"),
values = c("1", "2")) +
theme_light()
We thought a map where you could see the physical location of the properties would be the best way to visualize our model, so we appended the clusters to our dataset and installed the required packages.
boston$clusters <- neighborhood_clusters
##
##
checking for file ‘/private/var/folders/1b/zdyh7c3s3rj0lc18xfvdx88w0000gn/T/Rtmpbndyy1/remotesf1d446c76b3f/dkahle-ggmap-2d756e5/DESCRIPTION’ ...
✓ checking for file ‘/private/var/folders/1b/zdyh7c3s3rj0lc18xfvdx88w0000gn/T/Rtmpbndyy1/remotesf1d446c76b3f/dkahle-ggmap-2d756e5/DESCRIPTION’ (747ms)
##
─ preparing ‘ggmap’:
## checking DESCRIPTION meta-information ...
✓ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
## Removed empty directory ‘ggmap/.github’
##
─ building ‘ggmap_3.0.0.tar.gz’
##
##
Since this map is not interactive, we adjusted the center of the map to fit all of the properties and zoomed as far as we could without Airbnbs getting cut off. Here, of the 3585 observations, only 5 did not fit in the map.
map1 <- ggmap(get_googlemap(center = c(lon = -71.0759, lat = 42.319),
zoom = 12, scale = 2,
maptype ='terrain',
color = 'color'))+
geom_point(aes(x = longitude, y = latitude, colour = clusters), data = boston, size = 0.5) +
theme(legend.position="bottom")
map1
## Warning: Removed 5 rows containing missing values (geom_point).
We ran the map again, this time zooming in on central Boston to get a clearer look at it. Only 1747 properties are shown in this map, but we see here that our model is doing a pretty good job at predicting Airbnbs in Central Boston.
map_zoomed <- ggmap(get_googlemap(center = c(lon = -71.0759, lat = 42.35101),
zoom = 14, scale = 2,
maptype ='terrain',
color = 'color'))+
geom_point(aes(x = longitude, y = latitude, colour = clusters), data = boston, size = 0.5) +
theme(legend.position="bottom")
map_zoomed
## Warning: Removed 1838 rows containing missing values (geom_point).
Next we wanted to get a confusion matrix based on clusters to check our accuracy.
We started by making the neighborhood groups and cluster assignment factors
clust_boston$neighborhood_groups <- boston$neighborhood_groups
clust_boston$clusters <- neighborhood_clusters
clust_boston[,c(4,5)] <- lapply(clust_boston[,c(4,5)], as.factor)
And then partitioned into train, tune, test sets, and assigned our features/target
train_index <- createDataPartition(clust_boston$neighborhood_groups,
p = .7,
list = FALSE,
times = 1)
train <- clust_boston[train_index,]
tune_and_test <- clust_boston[-train_index, ]
tune_and_test_index <- createDataPartition(tune_and_test$neighborhood_groups,
p = .5,
list = FALSE,
times = 1)
tune <- tune_and_test[tune_and_test_index, ]
test <- tune_and_test[-tune_and_test_index, ]
features <- as.data.frame(train[,-c(4)])
target <- train$neighborhood_groups
And finally we ran the model to get our confusion matrix, and checked the variable importance.
set.seed(123)
boston_dt <- train(x=features,
y=target,
method="rpart")
varImp(boston_dt)
## rpart variable importance
##
## Overall
## price 100.00
## room_type 57.27
## clusters 56.66
## review_scores_rating 0.00
dt_predict_1 = predict(boston_dt,tune,type= "raw")
confusionMatrix(as.factor(dt_predict_1),
as.factor(tune$neighborhood_groups),
dnn=c("Prediction", "Actual"),
mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction Suburbs Central_Boston
## Suburbs 141 71
## Central_Boston 63 263
##
## Accuracy : 0.7509
## 95% CI : (0.7121, 0.7869)
## No Information Rate : 0.6208
## P-Value [Acc > NIR] : 1.015e-10
##
## Kappa : 0.475
##
## Mcnemar's Test P-Value : 0.5454
##
## Sensitivity : 0.6912
## Specificity : 0.7874
## Pos Pred Value : 0.6651
## Neg Pred Value : 0.8067
## Prevalence : 0.3792
## Detection Rate : 0.2621
## Detection Prevalence : 0.3941
## Balanced Accuracy : 0.7393
##
## 'Positive' Class : Suburbs
##
Our accuracy is 72% which isn’t the best, but we think our model did a good job predicting the neighborhood with the information it was given. We think that if we could have included more information that was in our original dataset, such as amenities, parking availability, and transit information, our model could have likely been more accurate, and we could have even split up our neighborhood groups more. (Maybe groups of downtown, central, suburbs). Unfortunately these variables were in sentence format written by the host, and were difficult to sort through and use in our clustering model.
The variable importance output shows us that price was by far the most important variable for predicting neighborhood, which is the relationship we wanted to explore in the beginning. Using our map, we think our model is a valuable tool for finding Airbnbs that could be considered a good deal. Since price was used 100% of the time to predict our clusters, we can draw conclustions about the relative price of an Airbnb compared to the neighborhood it is in. Any Airbnb which was actually located in Central Boston but predicted as the Suburbs likely have a price much lower than other similar Airbnbs in Central Boston. The opposite is also true, where incorrectly classifies properties in the Suburbs are likely more expensive than others.